MetaTransformer: deep metagenomic sequencing read classification using self-attention models

Abstract Deep learning has emerged as a paradigm that revolutionizes numerous domains of scientific research. Transformers have been utilized in language modeling outperforming previous approaches. Therefore, the utilization of deep learning as a tool for analyzing the genomic sequences is promising, yielding convincing results in fields such as motif identification and variant calling. DeepMicrobes, a machine learning-based classifier, has recently been introduced for taxonomic prediction at species and genus level. However, it relies on complex models based on bidirectional long short-term memory cells resulting in slow runtimes and excessive memory requirements, hampering its effective usability. We present MetaTransformer, a self-attention-based deep learning metagenomic analysis tool. Our transformer-encoder-based models enable efficient parallelization while outperforming DeepMicrobes in terms of species and genus classification abilities. Furthermore, we investigate approaches to reduce memory consumption and boost performance using different embedding schemes. As a result, we are able to achieve 2× to 5× speedup for inference compared to DeepMicrobes while keeping a significantly smaller memory footprint. MetaTransformer can be trained in 9 hours for genus and 16 hours for species prediction. Our results demonstrate performance improvements due to self-attention models and the impact of embedding schemes in deep learning on metagenomic sequencing data.


INTRODUCTION
About 1,000 bacterial species li v e in the human gut, exhibiting 150 times more genes than the human genome ( 1 ).Be-sides aiding the digesti v e system, a relation to the immune system has also been documented ( 1 , 2 ), suggesting imbalances in the gut microbiome can result in autoimmune and allergic diseases such as diabetes type 1, multiple sclerosis, inflammatory bowel disease (IBD) and others ( 2 , 3 ).Therefore, the analysis of microorganisms and their interaction with the human body is an important task in understanding and treating microbiome-associated diseases.
Recent advancements in DNA sequencing technologies have propelled new methods for microbial analysis.Of special inter est ar e metagenomic pipelines that provide deeper insights into local communities of organisms.These methods helped to identify thousands of unculti vatab le species (4)(5)(6), further expanding the knowledge of environments such as the human gut.
To identify species in a metagenomic study, sequenced DN A reads typicall y need to be queried against a database of annotated r efer ence genomes.The objecti v e is to map each read to the most likely genome of origin by finding the best matching location across all r efer ence sequences.Substrings of length k --called k-mers --are commonly used for such tasks ( 7 ).Reference databases are constructed by indexing all r efer ence genomes by their k -mers and ar e subsequently queried using the k -mers of a read to find the best matches.After candidate matches are identified, they are filtered either using a seed-and-extend-approach or a tax onom y-based v oting scheme in order to determine the best matching location for each read.This approach is used by many state-of-the-art mappers such as Kraken2 ( 8 ), MetaCache ( 9 ), or CLARK ( 10 ).Howe v er, new breakthroughs of deep learning in language modeling ( 11 ) and bioinformatic ( 12 , 13 ) suggest the potential power of metagenomic analysis via neural networks.
Recently, deep learning methods have been explored for sequence read mapping and abundance estimation to potentially overcome the reliance on frequently changing yet incomplete r efer ence databases as well as the high memory consumption of state-of-the-art mapping approaches.Using convolutional neural networks, GeNet ( 14 ) shows 2 NAR Genomics and Bioinformatics, 2023, Vol. 5, No. 3 the applicability of deep learning to mapping tasks while reducing the memory footprint.Howe v er, both accuracy and execution speed are inferior compared to state-of-theart mappers.Meta 2 ( 15 ) further explor ed differ ent embedding schemes as well as deep learning techniques such as self-attention and deep sets.As a result, Meta 2 outperforms GeNet, but still performs worse than traditional approaches.
Liang et al. proposed DeepMicrobes ( 16) --a deep learning-based approach for metagenomic read classification.By using a multi-layer and multi-ar chitectur e model (embedding, bidirectional long short-term memory (LSTM), self-attention, as well as dense layers) DeepMicrobes is able to outperform state-of-the-art k -mer-based ma pping a pproaches for species and genus identification in metagenomic gut data.Howe v er, this approach e xhibits se veral shortcomings induced by the underlying LSTM architectur e.The r ecurr ent natur e of such networks tends to be more difficult to train ( 17 ) as well as less parallelizable due to the inherently sequential nature of LSTM units.
In 2017, Vaswani et al. ( 18 ) proposed bi-directional transformers as a neural network ar chitectur e that avoids any recurrence or convolutional components to solve tasks on sequential data.Instead, it relies on the self-attention mechanism to correlate all pairwise input tokens for later reweighting of the input sequence.The underlying inner product structure allows for straightforward parallelization of matrix products on massi v ely parallel har dware, further being accelerated by dedicated tensor core processing units.
Bertasius et al. ( 19 ) argued that the less restricti v e selfattention mechanism reduces inducti v e bias of the trainable map compared to plain convolutions that enforce strict translation equivariance of the input by construction.Selfattention further allows for mixing contributions of input tokens for the whole sequence, hence effecti v ely covering broader recepti v e fields compared to convolutional cascades with thin filter kernels ( 20 ).Other desirable properties of transformers such as faster convergence of large language models when increasing the number of trainable parameters ( 21 , 22 ) and their remar kab le fe w-shot capabilities when trained in a self-supervised fashion make them a popular choice for processing sequential data.In practice, the actual maximum sequence length is limited by the quadratic complexity of the attention mechanism.Alleviating this shortcoming is still part of curr ent r esear ch.As of now, ar chitectur es based on the original idea of the transformer achie v ed state-of-the-art perf ormance in almost an y natural language processing task (23)(24)(25)(26)(27).
Lerna ( 28 ), a transformer-based model on short-and long-read data for parameter tuning of error correction tools, generates language models of the uncorrected reads and builds a perplexity metric out of this model.Using the perplexity metric, the corrected reads can be evaluated without the need of r efer ence genomes.Further, the input embedding is discussed, deviating from character-size embeddings, as previously used in RNN approaches, to word-size embeddings as the improved data complexity yields better r esults.Ther efor e, the authors were able to improve the error correction pipeline by 18 × due to the parallelization of the attention mechanism in transformer networks, the utilization of just-in-time compilation and GPU utilization.
Gi v en their great performance in language processing, a ppl ying transformer models on sequence based bioinformatics tasks is a logical next step.Waele et al. ( 29 ) used transformer models on single-cell methyla tion da ta.Here, incompleteness or low coverage impedes the adoption of current single-cell DNA methylation protocols.Therefore, a model for the understanding the biological processes to impute single-cell methylation is needed.By combining axial attention with sliding window attention, the introduced GpG transformer was able to learn the interaction between neighboring CpG sites.Furthermore, it was trained on noisy data, resulting not only in imputation but also denoising capabilities while outperforming other imputation approaches.
Another transformer based approach was used for identifying bacteriophage contigs from metagenomic protein data.Here, protein-based tokens (protein cluster with high similarity) were fed into a transformer to find bacteriophage contigs in high di v ersity / abundance data.Due to the adoption of transformer models and good negati v e samples for training, PharMer was able to outperforms other deep learning approaches and mapper on multiple datasets ( 30 ).
DNABERT used a more sophisticated approach of the standard transformer to model different tasks on DNA data.It is based on the BERT language r epr esentation model, which achie v ed improv ements on ele v en natural language processing tasks and is to this date under the top performing models with its improved versions ( 31 , 32 ).BERT is trained in two stages, a pre-training step where an unlabeled training set was used with different task (next sentence, masked / missing information prediction) and a finetune step with labeled data and the desired task.Therefor e, once pr e-trained multiple fine-tuning possibilities can be applied on the net ( 11 ).With BERT as a foundation, DNABERT pre-trained on k -mer tokens (with k between 3 and 6) and fine-tunes the model for different tasks like transcription binding site, proximal and core promoter regions, canonical and non-canonical splice sites prediction and several more with good predictions ( 13 ).Although, the resulting models are large and training times for pre-training as well as fine-tuning are resource intensi v e.Furthermore, only small k -mers were tested which could hamper the prediction capabilities as described by DeepMicrobes and DNABERT, showing a better performance with higher k -mers.
In this paper we introduce MetaTransformer, a novel selfattention-based ar chitectur e for metagenomic read classification.Our goal is to improve the classification speed as well as the prediction accuracy of deep learning approaches for classification tasks.We reduce the memory footprint, improve the computational time for both inference and training, compared to previous works, and analyze different embedding schemes.As a result, the model can e v en be e xecuted on a single modern consumer-based GPU.Furthermor e, we explor e additional optimization strategies such as sparse embedding gradients and mix ed-pr ecision computation.We subsequently evaluate our proposed model against DeepMicrobes, as well as traditional k -mer-based methods.

MATERIALS AND METHODS
All datasets are summarized in Table 1 .Bioinformatics, 2023, Vol. 5, No.

T r aining data
For training of the species and genus classification model, sequence data provided by Almeida et al. ( 4 ) was used.It consists of 1,952 unclassified metagenomic species (UMGS) and 553 species from the human gut r efer ence (HGR).The dataset will be r eferr ed to as HGR UMGS .UMGS data was obtained during the reconstruction of 92,143 metagenomeassembled genomes originating from 11,850 different human gut microbiomes ( 4 ).The quality of the metagenomeassembled genomes (MAGs) was assessed using CheckM ( 33 ).Only sequences exhibiting at least 50% completeness and less than 10% contamination were kept.For speciesle v el training, all sequences spanning 2,505 classes were used.On genus le v el, only sequences with genus assignment were kept.This resulted in 1,535 of the 2,505 MAGs contributing to genus le v el training.A rough ov ervie w of the distribution of the ten most frequent genera can be found in Figure 1 .
In-silico sequence read generation.All models were trained and evaluated using short-read data.Unfortunately, largescale real-world datasets often lack the necessary ground truth labels needed for model optimization and evaluation.We thus used artificially generated read datasets stemming from HGR UMGS genomes by simulating reads using the ART Illumina read simulator ( 34 ).ART generates reads based on empirically determined profiles of sequencing technologies that include read error models as well as base quality values.For training, we generated 150 bp pair ed-end r eads with an insert size of 400bp and a standar d de viation of 50 bp.The error model corresponds to the Illumina HiSeq2500 sequencing technology.P air ed-end r eads wer e tr eated as single-end r eads during training for all models.
Prior to training, we shuffled all reads randomly.Shuffling was performed to provide a di v erse set of species or genera in each training batch in order to avoid local optima convergence during the first few iterations.We used the same set of source reference genomes for the validation data, but selected a different random seed for read genera-tion.Shuffling was also applied to the valida tion da ta.Separ ate tr aining and valida tion da tasets wer e cr eated for each taxonomic le v el of interest, i.e. species and gen us.The n umber of generated reads was scaled to allow the models to converge with respect to the valida tion da ta without reusing the same reads multiple times during training.
To pre v ent class imbalance effects, the coverage needed to be adapted to compensate for varying genome sizes.In order to obtain a balanced training dataset we used the following approach: • Input: N genomes G 1 , ..., G N and coverage factor f • Determine genome lengths L 1 , ..., L N and the maximal genome length L max • Determine per-genome coverage Since we have control over the number of samples generated and are not restricted by a fixed-sized dataset, we opted for dataset-le v el balancing.For the training sets a cov erage factor of 3 and 1 for validation was used.

Evaluation data
Human gut benc hmar k data.To evaluate the trained models, we used two benchmark datasets provided by the authors of DeepMicrobes ( 16 ).The first one contains 3,269 high-quality MAGs obtained from an assembly (accession ERP108418) ( 4 ) of which only MAGs with at least 90% completeness, 0% contamination, and 0% strain heterogeneity were kept.For labeling, the following steps were applied: First, MASH ( 35 ) was employed to identify the closest corresponding pairs between benchmark and training genomes with the lowest MASH distance.Those candidates were subsequently globally aligned using MUMer ( 36 ).Benchmark genomes which exhibit at least 60% AQ (fraction of query aligned to r efer ence) and 95% ANI (average nucleotide identity in the aligned fractions), regarding the alignment, were labeled with the same species as the matched training genome.For the resulting 3,269 genomes, 10,000 in-silico paired-end reads per genome were generated in the same manner as the training reads.Since the training MAGs originate from the same assembl y, onl y MAGs which are not used for training were included.For further information the script for labeling the MAGs is available at https://github.com/Finn-Lab/MGS-gut .The dataset will be r eferr ed to as Benc hmar kGut .
Mock community data.We used 10 mock communities created by Liang et al. ( 16 ).To generate these communities, whole genome sequenced isolates from two studies (ERP105624 and ERP012217 ( 37 )) were used of which N = 258 samples with genus le v el annotations were selected.The abundance of each member of a community was drawn from a log-normal distribution with = 1 and = 2.
Let A 1 , ..., A N denote the observations drawn from this distribution.The mock communities contain M = 10,000,000 samples.Each observation A i can be transformed into an actual read count R i by normalizing its value by the sum of observations and subsequent scaling according to the desired read count A different random seed for drawing from the log-normal distribution was used for each mock community.Reads of length 100bp were generated with ART using the same parameters as for the Benc hmar kGut dataset.The resulting dataset will be r eferr ed to as Benc hmar kMoc k .
Species absent from reference data..The absent species da taset crea ted by Liang et al. is based on 7,903 species genomes accessible via NCBI BioProject Accession PR-JNA348753 ( 38 ).These genomes were aligned against the 2,505 genomes used for training.After aligning, genomes with an ANI of < 95% and an AQ < 60% to the closest relati v es as well as an overall AQ > 10% were classified as 'absent' resulting in 121 genomes.A new dataset was simulated from the 'absent' genomes using ART as described in the subsection In-Silico Read Generation .Like for the valida tion da ta, we used a new seed and a coverage of one.The dataset is r eferr ed to as AbsentSpecies .

Multi-omics inflammatory bo w el disease database.
To test the usability of our approach on real-world data we downloaded 106 microbioms from subjects with and without inflammatory bowel disease via the SRA BioProject PR-JNA398089 ( 39 ).There by, w e used the same randomly chosen samples as DeepMicrobes consisting of 26 healthy subjects, 50 subjects with Crohn's disease (cd) and 30 subjects with ulcerati v e colitis (uc).All samples wer e pr eprocessed using Trimmomatic (v.0.33) ( 40 ) with a minimum read length of 75bp and KneadData (v 0.12.0) to remove host reads.After preprocessing we calculated the abun-dance using MetaTransformer.To analyze differential microbial expression in the three groups we used the linear discriminant analysis effect size (LEfSe) ( 41 ).This consists of a non-parametric Kruskal-Wallis test to determine whether at least one group (uc, cd or control) is significantly ( P < 0.05) different from another for each species and a linear discriminant analysis (LDA) model build on these significant species to assess its effect size.Species with an LDA effect size of > 2.0 were considered to be significant.

T r ansf ormer -encoder netw ork
In the following, we introduce our encoder-only architecture.Consider an input sequence of embedded tokens S = [ x 1 , .., x n ] T ∈ R n ×d model extracted from a sequencing r ead.Her e, n denotes the number of tokens and d the dimension of embedding vectors.Different embedding techniques (LSH-EmbedPool ( 15 ), Hash-EmbedPool ( 15 ), k -mer embedding ( 16)), as well as different tokenizations (character-le v el, byte-pair ( 42 ), k -mer) were tested.Both tokenizations and embeddings schemes are further explained in subsection Embedding Schemes and Tokenization as well as Supplementary Figure S1 in the Supplement.
Positionial encoding.Before the embedded tokens are passed to the next layer, they are aggregated by means of positional encoding to inject the notion of position into token embeddings.Hence, the position of sequence elements can be preserved without recurrence or convolution.Positional encoding can be a scalar or a vector for each element of the sequence and fixed or learned during training.
Here, we use a fixed positional encoding based on sine and cosine functions ( 18 ).For a position p ∈ { 1, ... n } the positional encoding is a vector of dimension d model defined as where p denotes the current position in the sequence and j the j th component of the positional encoding vector for position p .The frequency of the sinusoidal functions is gi v en by f k which denotes a geometric pr ogression fr om 1 to 1 10 , 000 .Ther efor e, the wavelengths increase monotonically from 2 to 2 • 10,000 over the dimensions of the positional encoding vector.The resulting positional encoding is robust with respect to uniqueness, it is symmetrically bounded between −1 and 1, and the network might learn to attend relati v e positions since it can be expressed as a linear mapping.The complete encoding for all sequence positions forms a matrix P ∈ [ −1 , 1] n ×d model .It is added pointwise to the token embedding resulting in a final input encoding E = ( S √ d model ) P .The result E is passed to the next layer which is the first of potentiall y m ultiple encoder blocks.An overview of an encoder block is shown in Figure 2 .
Multi-head attention.The first and most important element of the encoder block is the multi-head self-attention module.We first describe the single attention-head variant before extending it to multiple heads.Gi v en the input encoding E ∈ R n ×d model three dif ferent representa tions called query ( Q ), key ( K ) and value ( V ) are deri v ed by means of a linear transformation of E gi v en by where W Q , W K , W V ∈ R d model ×d model are tr ainable par ameters of the network.The self-attention mechanism is given by The dot products between the sequence elements in Q and K indicate the 'similarity' of the respecti v e elements.Further, the softmax function is subsequently applied row-wise on QK T ∈ R n ×n .The resulting matrix contains a set of weights in each row to compute the weighted sum of the input values V ∈ R n ×d model .An optional mask matrix M ∈ R n ×n can be used to manually restrict the network to attend specific tokens of the sequence.This could be for example positions in the sequence that denote padding.To mask the token at position p so that it can not be attended, the p -th column of M would be set to −∞ .The scaling factor 1 / √ d model works analo gousl y to standard dot product attention ( 43 ).With large values of d model the dot product produces bigger results possibly leading to extreme values after the softmax function is applied.These values in turn could lead to vanishing gradients.To counteract this problem, the values of the dot product are scaled by d model .To allow the model to a ttend informa tion based on varying patterns multiple attention-heads are employed.
They are computed as where h denotes the number of attention-heads and Here, each attention-head i maintains its own set of weights W i .Figure 3 illustrates our multi-head self-attention design.Before passing the result Z to the next layer, the result is conca tena ted and linearly transformed by W O to retain an equal dimension of d model throughout the whole ar chitectur e.
Adding more heads ine vitab ly increases the computational complexity of the layer.To counteract this problem, each a ttention-head opera tes on an a ttention sub-space w hose size inversel y scales with the number of attention heads h .By setting d k = d v = d model / h the complexity of the la yer is in variant with respect to the number of attention heads.The result Z is added to the original input E through a residual connection Z = Z + E before being passed to a layer normalization ( 44 ) over the feature dimension.Layer normalization is further explained in the subsection Layer Normalization in the Supplement.
The normalized elements of the sequence in N are subsequently passed to a (fully-connected) multilayer perceptron with a single hidden layer that is applied to each position of the sequence independently.
are tr ainable par ameters.d ff denotes the hidden dimension of the MLP.The final output of the encoder block is computed by a ppl ying another residual connection followed by layer normalization This encoder block design can be stacked multiple times to generate high-le v el r epr esentations.After the last encoder block, the output is fed into a linear classifier.Since O is a sequence of r epr esentations, it has to be transformed into a single vector which can be used with a fully-connected layer.The first option is to simply conca tena te all vectors of the sequence into a single long vector.This is sub-optimal since it is dependent on the sequence length producing nonfixed-sized outputs.The second option is to take the mean along the sequence dimension across all sequence elements.The third option is to prepend a special classification token < cls > to each sequence and solely use it for classification.Since the 0th element of the sequence will always be used for classification, the network will learn to compress all information into this vector that is relevant for classification.
To summarize, the final sequence of r epr esentations can be deri v ed by   O red = 1 O red = O 0 , (cls token) (11) where where W 1 ∈ R d model ×m and b 1 ∈ R m are trainable parameters and m denotes the number of output classes.For the choice of hyperparameters we refer to ( 18 , 45 ).Further, the memory consumption of the embedding and the transformer are important factors to be taken into account to meet the memory capacity constrains of GPUs.An ov ervie w ov er the hyperparameters that we evaluated can be found in Table 2 .

Multi-level classification
While models like DeepMicrobes perform well at a single taxonomic rank, it is often r equir ed to perform classification at multiple taxonomic le v els.For e xample, if a read cannot be classified at species le v el with sufficiently high confidence, the user would still be interested in the genusle v el classification.In order to predict at multiple ranks, we can add more classification layers after the backbone of the model.The backbone constitutes the part of the model that generates an encoded r epr esentation from the input data.Each classification head operates independently from the other heads.Let x b ∈ R dim b be the output of the backbone for a single input instance x with dimension dim b .Let furthermore M = ( M 1 , ..., M N ) denote the number of output classes a t N dif ferent taxonomic le v els.Then each classification head H i produces an output where ˆ y i ∈ R M i denotes the unnormalized class scores over the output classes M i .MLP i is the corresponding multilayer-perceptron with input and output dimension dim b and C i , respecti v ely.The MLP can exhibit an arbitrary amount of hidden layers of suitable dimensions.
Rojas-Carulla et al. ( 14 ) augmented this basic idea by introducing interconnections between the classification heads.This has the advantage that higher-le v el layers can directly inform lower-le v el layers about their prediction.Let x b ∈ R dim b denote the r epr esentation of input x again.The prediction on le v el i of the multi-headed interconnected classification layer is described by where ˆ y i ∈ R M i denotes the prediction and MLP i the respecti v e MLP on le v el i .To use the prediction ˆ y i −1 from the previous le v el, its dimension is tr ansformed using a tr ainable linear layer T i −1 ∈ R M i ×M i −1 and T 0 denotes the zero matrix.Our proposed classification scheme is illustrated in Figure 4 .
During training, the classification heads are jointly trained.Ther efor e, the loss function needs to be adapted to account for the newly introduced heads.Let Y = ( y 1 , ..., y N ) and ˆ Y = ( ˆ y 1 , ..., ˆ y N ) be the ground-truth taxa and the predicted taxa on each taxonomic le v el, respecti v ely.For the following, we assume that a cr oss-entr opy loss is used at each le v el i .It is defined as Note that any loss function suitable for multi-class prediction could be used instead.A problem that is reintroduced by multi-le v el classification is class imbalance.While it is possible to establish class balance at a single taxonomic le v el by sampling an equal proportion of reads per class during read-generation, the same can not be achie v ed for multiple le v els.Consider a balanced dataset on the species le v el.When performing genus le v el prediction for this dataset, it is likely that the species will not be assigned to the genera in equal proportion.The same pattern will continue for higher le v els.To avoid this issue weighting factors ( w 1 , ..., w N ) with w i ∈ R C i are introduced.Each w i counteracts over-or under-r epr esented classes on the respecti v e le v el i .The cr oss-entr opy loss function would be then altered to Each w i needs to be calculated once on the training data before training.The weights w i are the normalized inverse counts of the of the classes C i .In practice, interconnected multi-headed classification should perform at least equally to normal multi-headed classification.It can potentially perform better since higher-le v el predictions can inform lower le v els about their decision.

T r aining
All models were trained using the Adam optimizer ( 46 ) with an initial learning rate of 0.01.When training a model with sparse embedding gradients, the weights of the embedding were separately optimized by SparseAdam, a PyTorch ( 47 ) implementation of Adam that is suited for sparse gradients.Since we are targeting classification with more than two classes, we used cr oss-entr opy loss as cost function as seen in Equation 14.We tracked the loss on the validation data to determine the training progress and pre v ent ov erfitting.For all models, we used a batch size of 2,048.Speciesand genus-le v el models were trained for a maximum of 300,000 or 500,000 steps.Early stopping was used when the model did not improve on the valida tion da ta for more than 50,000 steps.The training data was read sequentially from the FASTA files through a custom implementation of the It-erableDataset class from PyTorch.This drastically reduces RAM usage since only data r equir ed for the current batch needs to be held in memory.We used a PyTorch Dataloader with 8 to 16 w ork er processes and a pre-fetch factor of 2 to sa tura te the GPU during training.Training inputs are raw DN A reads w hich are subsequentl y transformed using custom implementations of PyTorch transforms objects.This includes tokenization into characters, k -mers or sub-words like byte-pair econding (BPE).Furthermore, nai v e hashing and locality-sensiti v e hashing (LSH) are also implemented as transforms since they operate on k -mer indices.We implemented LSH as a Cython extension ( 48 ).In this manner, we operated on C-strings and could directly call the C++ implementation of MurmurHash3 ( 49 ) for hashing.We also added the option to use mix ed-pr ecision computations for an additional speedup and lower memory r equir ements during training and inference.Major frame wor ks alr eady include r especti v e modules to easily add mixedprecision capability.In our case, we use the AMP (automatic mix ed-pr ecision) module shipped with PyTor ch.

Inference
Similar to training, DNA reads constitute the input during infer ence.Her e, the same set of transformations used in training were applied.We used paired-end reads during testing to improve r esolution.Befor e infer ence, pair ed-end sequences were interleaved.The model then performs the prediction separately on each of the sequences.Afterwards, the r esults wer e component-wise averaged befor e a ppl ying the softmax function.To sa tura te the GPU during inference, we used the dataloader with up to 80 threads.

Evaluation metric
In order to evaluate the efficacy of our model, we employ a range of quantitati v e measures.For assessing the test data, we utilize precision and recall metrics defined as: Recall read = # reads classified correctly # reads (17) Moreover, in our study, we incorporate the L2 distance metric and employ the Linear Discriminant Analysis Effect Size (LEfSe) ( 41 ) for conducting differential abundance analysis.The LEfSe tool utilized in our analysis was obtained from the Huttenhower Galaxy Serv er (availab le at http://galaxy .biobakery.org/).For the Kruskal-Wallis test, we set a significance threshold of 0.05, and the logarithmic Linear Discriminant Analysis (LDA) score threshold for identifying discriminati v e features was set at 2.0.Additionall y, we a pplied per-sample normalization of the sum of the values to 1 million.

Har dw ar e and softw ar e
Taining and benchmarking were conducted on a standalone worksta tion fea turing two Intel Xeon Gold 6238 processors with 22 cores each and 188GB of RAM.Additionally, it contained an NVIDIA Quadro GV100 GPU with 32GB HBM2.We used Python 3.8 for training and experiments.All models were run using PyTorch 1.8.1 and CUDA 11.2.The byte-pair encoding was generated using the BPE model from the huggingface tokenizer library.We used Jellyfish ( 50 ) to create k -mer vocabularies.

Embeddings and tokenizations
Performance of k -mer-based models.We tested our transformer -encoder -based model on k -mer tokenization in the following r eferr ed to as Vocab .k -mer size is a key factor f or perf ormance.Experiments on the LSTM-based ar chitectur e have shown that larger values of k are desirable while accuracy drops ra pidl y for lower values of k .Our training was performed for k ranging from 7 to 13.
In contrast to Liang et al. ( 16 ), we were able to evaluate k = 13 for our transformer by using mix ed-pr ecision, sparse embedding gradients, and a smaller model dimension d model = 64.The results are summarized in Figure 5 .Like DeepMicrobes, Vocab exhibits low precision and recall on the validation set for k < 11.The best-performing model uses k = 13.We suspect that larger values of k would be beneficial, especiall y w hen dealing with more classes.Unfortunatel y, increasing k also increases memory consumption, e.g.k = 14 exhausts 32GB of GPU memory and could ther efor e not be tested on our GPU.Hence, for further experiments we focused on the k = 12 and k = 13 model.Furthermore, Vocab model with k = 12 optimized for multi-le v el classification ( Vocab ML ) was trained.Though, Vocab ML showed a lower precision and recall compared to Vocabk = 12 and k = 13 for the genus pr ediction.Pr ecision and recall for Vocab k = 12 and k = 13 and Vocab ML are summarized in Table 3 .
Performance of character-level tokenization and byte-pair encoding.We tested character-le v el tokenization and bytepair encoding schemes with our proposed ar chitectur e.For character-le v el encoding each character is r epr esented by an embedding according to its index in the vocabulary.Since the employed embedding only had few entries, namely the 4 bases A, C, G,and T, we experimented with deeper architectures featuring more encoder blocks.We tested an architecture with an embedding size of 128 and 6 encoder blocks.Additionally, we used an embedding of size 64 and increased the number of encoder blocks even further up to 12 blocks.The best result was achieved with a dimension of 128 and 6 encoder blocks at a precision of 0.764 and a recall of 0.265 on the HGR UMGS valida tion da ta.Compared to k -mer-based approaches the performance is drasticall y lower.Naturall y, with longer substrings, the vocabulay size increases allowing for more detailed pattern recognition ( 28 ).
For byte-pair encoding we tested similar hyperparameters as the k -mer-based networks with a vocabulary size of 2 22 and 2 23 .In contrast to tokenization based on kmers, subwor d-le v el tokenization produces much shorter sequences.This allowed us to increase the number of transformer encoder blocks up to 4 due to the reduced memory r equir ements.The best model with 1 block achieved a precision and recall of 0.891 and 0.643 on the HGR UMGS valida tion da ta.This is also significantly worse compared to the k -mer-based variant summarized in Table 3 .We attribute the reduced performance to the fact that tokens in bytepair encoding do not overlap.Ther efor e, single nucleotide changes may lead to entirely dif ferent sub w ord tok ens.Additionally, subw ord tok ens do not allow for creation of local canonical r epr esentations of a read as it is the case with kmers.As a result, we use the k -mer dri v en approaches for further analysis.
Performance of LSH and hash embeddings.In contrast to Georgiou et al. ( 15 ), we propose a less memory-restricti v e usage of LSH and hash-embeddings.The memory reduction schemes should rather be used to enable larger k -mer sizes than heavily reducing the memory footprint.Here, the transformer -encoder -based on LSH or hash embedding schemes are referred to LSH and Hash respecti v ely.For LSH we opted for b = 2 22 or b = 2 23 buckets and k = 15.For hash embedding we used a bucket size of b = 2 22 , q = 6 hash weights per k -mer and a k -mer size of k = 13.The results of the respecti v e schemes at genus-le v el are summarized in Table 3 .Both versions of LSH achieve reasonable performance on the valida tion da ta.LSH with only b = 2 22  buckets seems to generalize worse than for b = 2 23 as indicated by the validation loss.The Hash performance is located in-between the variants of LSH .Overall, LSH and Hash show better precision and recall compared to their character-le v el and byte-pair counter parts.Though, the kmer embedding approaches without compression show an overall better performance even though having a lower k compared to LSH .Theref ore, f or the f ollowing genus analysis steps Vocab k = 12 and k = 13 were used and are referred to as MetaT k12 and MetaT k13 .Species prediction.For species prediction Vocab k = 12 and k = 13 were selected due to their good performance at genus le v el.Further, Vocab ML can also classify species sequences.Table 4 shows pr ecision, r ecall, loss and training time for the three models.Like for genus prediction, Vocab k = 13 shows the best performance on both training and validation set, followed by Vocab ML and Vocab  Performance of the Vocab embedding schemes in combination with the transformer model on the species le v el.We only included the best performing variants for each scheme.Gi v en are the precision, recall and loss of the model on the training and validation set as well as the training time of each model.Both k = 12 and k = 13 are trained for 500,000 epochs while ML is the same model as seen in genus with 1,000,000 epochs.k = 12.As a result, we used MetaT k12 and MetaT k13 , due to their good performance and good comparability to DeepMicrobes for further species analysis.

Performance of MetaTransformer on test data
MOCK community analysis.In the following we analyze the performance of different embedding schemes for the genus estimation on the Benc hmar kMoc k dataset.The best performing models during training, MetaT k12 and MetaT k13 , were tested against current classification tools for genus and species classification on this dataset.As comparison metric, the L2 Norm distance is used.The results of our different embedding schemes for genus abundance estimation are summarized in Figure 6 .Best performing model for genus prediction was again Vocab k = 13 model with an average distance of 0.00207, followed by the k = 12 variant (0.00239), and DeepMicrobes (0.00242).LSH with k = 15 and 2 23 buckets performed worst with 0.00400 e v en though e xhibiting a good performance during training.All in all, the results on the Benc hmar kMoc k dataset reflect the performance seen on the validation set.Only Vocab shows a lower distance than DeepMicrobes underlining the importance of a suitable embedding scheme f or model perf ormance.For genus classification of the other tools we measure the precision and recall as previously defined.All tools are build on the same genomes we used for training our model.Our transformer model outperformed DeepMicrobes in recall (0.911 vs 0.881) yet has a slightly lower precision (0.982 vs 0.985) for k = 13.Ne v ertheless, both deep learning models were outperformed by Kraken2 and CLARK.Centrifuge scored the lowest score.Note, the pr ecision and r ecall ar e equal for centrifuge, because e v ery read was classified.Precision and recall results are summarized in Figures 7 and 8 .
MA G anal ysis.The Benc hmar kGut dataset consist of 3269 gut deri v ed high quality MAGs.For each MAG the closest corresponding species was calculated as previously described.We use this dataset to evaluate the species prediction capabilities of our approach.The precision and recall for species prediction are summarized in Figures 9 and  10 .For the other tools custom databases and taxonomic tr ees wer e generated to use the same species genomes as the deep learning a pproaches.Our a pproach outperforms DeepMicrobes in recall and precision (0.904 versus 0.896 and 0.780 versus 0.516) for k = 13.Though, it performs slightly worse than Kraken2 in prediction.The best tool overall for precision is CLARK and Kraken2 (0.924) while centrifuge scores the best recall (0.901).
Absent species.Another important problem for metagenome analysis are absent species in the database leading to high false-positi v e rates.Ther efor e, the Ab-sentSpecies dataset containing 121 species absent from all databases was generated ( 16 ).Figure 11 shows the total accumulated abundance of se v eral classifier.In this 10 NAR Genomics and Bioinformatics, 2023, Vol. 5, No. 3  experiment, Kraken2 outperformed all classifiers with 0 false positi v es followed by DeepMicrobes and our approaches (20,324 versus 111,229 and 147,316).Centrifuge mapped e v ery read hence it performed worse with an abundance of 564,819.

Differential species abundance analysis on real-world data
We analyzed 106 sequenced gut metagenome fecal samples using our MetaTransformer model ( MetaT k13 ).The samples are part of iHMP ( 39 ) and were collected from subjects with ulcerati v e colitis (uc), Crohn's disease (cd) and healthy specimens.We want to test if ne wly discov er ed genomes ar e correlated with IBD and if our model is suitable for real world data.Ther efor e, we calculated abundances via our model and determined differential abundant species using LEfSe.
After the Kruskal-Wallis test and LDA calculation, 28 species showed a differential abundance over the different groups.Especially, we found differential abundance in Alistipes , Anaerotruncus , Clostridium , Collinsella , Eubacterium , Lachnospira , Roseburia , Ruminococcus and unclassified species of the order Clostridiales .All species and their LDA scores are summarized in Figure 12 .From these 28 species, fiv e (14207 7 83, GCF 000155855, GCF 000174195, UMGS365, UMGS368) displayed differential abundance in DeepMicrobes.Further, most found genera are identical between both tools, though DeepMicrobes found differences in Pre v otella, Coprococcus, Alphapr oteobacteria, Anaer omassilibacillus and Bacteroides ,   Overall, the identified species indicate a correlation to healthy, uc and cd subjects as reported for their genus / order.For example UMGS175 ( Anaerotruncus ) shows a decreased abundance of cd compared to the control and uc sample, which is described by Zhuang et al. ( 51 ) for this genus.UMGS456 of genus Ruminococcus expresses a decreased abundance in both cd and uc, compared to the control, following the same pattern as outlined in the literature ( 51 , 52 ).Though, species-specific patterns like lower abundance of R. obeum, R. callidus and R. lactaris in IBD ( 53 ), higher abundance in cd like R. gnavus (53)(54)(55)(56) or a loci specific abundance with a lower abundance of Ruminococcus in the ileum and a higher abundance in the colonic region are characterized ( 56 ).
Clostridium in general shows a lower abundance in IBD specimens ( 52 , 56 ).Ne v ertheless, fluctuations hav e been shown for different species like a higher abundance in cd for C. difficile , a higher abundance in uc for C. clostridioforme ( 53 ), higher abundance in healthy subjects for C. innocuum and C. ramnosum ( 55 ) and a lower abundance in healthy subjects for C. leptum ( 53 , 55 , 56 ) We estimated a higher abundance on uc compared to cd and healthy subjects for UMGS129 (Figure 13 A) while GCF 001078425 (Figure 13 B) has the highest abundance for cd followed by uc.Also, species like C. innocuum and C. ramnosum are described with the highest abundance for the control group ( 55 ), underlining the species specificity of differential abundances in the bowel flora.
UMGS368 (Figure 13 C) belongs to the genus Alistipes and showed a higher abundance in healthy specimens compared to IBD subjects.The same results can be seen in previous studies for different species of Alistipes like A. shahii, A. finegoldii and A. putredinis ( 39 , 57 ) as well as in the Deep-Microbes study ( 16 ) showing a compara ble a bundance for the same species.
12 NAR Genomics and Bioinformatics, 2023, Vol. 5, No. 3  LDA score of significant dif ferentia ted species.LDA score of e v ery species with significant (Kruskal-Wallis P < 0.05 and LDA score > 2) difference between two or more groups.cd is r epr esentati v e for Crohn's disease, uc for ulcerati v e colitis and control for healthy subjects.In brackets is the closest taxonomic classification for each species.
Two species of Lachnospira show a significant difference between the groups.In both cases, uc has the highest mean abundance (Figure 13 D).In general, Lachnospira shows a higher abundance in uc and cd compared to control subjects ( 52 , 58 ).Though, differences between species of Lachnospira with a lower abundance in uc and cd are reported as well ( 56 , 59 , 60 ).
All in all, we found significant differential abundance in IBD and control specimens for ne wly discov ered species.These new species might provide insights of the development as well as the diagnosis and trea tment IBD pa tients.The abundant densities for e v ery significant differentially abundant species are listed in the Supplement (Supplementary Figure S2-S29).

T r aining and inference speed
T r aining.Model training times are dependent on the type of embedding scheme, number of learning steps, performance booster (AMP and sparse matrix), model size, as well as training le v el like genus or species.Tables 3 and 4 show the training times of different embeddings for genus and species abundance prediction.The fastest trained genus model was MetaT k13 with around 10 hours training time for 300,000 steps.With over 43 hours and 1,000,000 steps the multi-le v el variant took the longest, though it learned genus and species prediction sim ultaneousl y.For species prediction, MetaT kmer13 with 16 hours was again the fastest, 4 h faster than the 12 k -mer variant with the same 500,000 steps.5 Another important factor is data preprocessing.DeepMicrobes needs a lengthy transformation step before working on the data.These steps are run via multiple programs using single and multi-threaded parts.For the Benc hmar kMoc k dataset each of the ten MOCK samples took around 25 min to complete which is about half the time needed for the evaluation only.Furthermore, after the preprocessing steps the size of each file inflates 6-fold.The preprocessing embedding steps were ignored in the comparison.MetaTransformer calculates the embeddings on-thefly using multi-threaded CPU processing and GPU accelera ted computing, grea tly reducing space consumption and time.

DISCUSSION
Taxonomic classification tools are often based on alignment-free mapping approaches enabling the analysis  of millions of reads with a high accuracy.Deep learning has shown its potential in a variety of problems on sequence-based data ( 12 , 13 , 28-30 ).Ther efor e, new classification approaches were explored using deep learning with good results on genus and species prediction e v en on non-cura ted da tabases ( 16 ).Howe v er, deep learning-based methods come with a high computational cost and lower throughput, especiall y w hen using bi-directional LSTMs and compared to state-of-the-art conventional metagenomic classifiers ( 8 , 16 , 61 ).Yet, with new advancements in deep learning, like complex transformer models ( 18 ) such as BERT ( 11 ) for language modeling, new approaches have the potential to improve metagenomic classification tasks.In this work.we introduced MetaTransformer, a transformer-based classification algorithm for genus and species le v els.We were ab le to train the networ k within 9 − 16h and speed up inference time by up to 5 times compared to the LSTM approach.Furthermore, we can match the classification performance of a previous approach, DeepMicrobes, in genus prediction and outperform it at species-le v el prediction on our test setups.
One limitation of our approach is the updateability.Adding new species / genera would in principle r equir e r etraining of the entire networ k.Ne v ertheless, it is possib le to retrain on the previous model while adding new species, as seen in incremental learning ( 62 , 63 ) or to retrain with an entirely new set of genomes in a transfer learning manner ( 64 ).Such approaches would gr eatly r educe the computational cost of changing the underlying database and need to be further investigated.
Unf ortunately, we still underperf or m in ter ms of several metrics compared to conventional alignment-free approaches.Her e, a mor e sophisticated approach like DNABERT needs to be explored and tested for metagenomics.Furthermor e, mor e di v erse input data for intensi v e training and testing is needed.Ne v ertheless, with such sophisticated models the throughput and learning times typically increase from the additional complexity.Also classification w ould tak e considerable more time.We tested the runtime of BERTax ( 65 ), a classifier based on BERT, resulting in an average throughput of 37 reads per second on average tested on the Benc hmar kMoc k dataset, w hich is 500 times slower than our approach.
We also incorporated memory reducing LSH and Hashembedding schemes into our new ar chitectur es.Instead of solely focusing on the reduction of memory, we used them to enable larger k -mer sizes up to k = 15.Although we could not observe any significant impro vements o ver the k -mer-based Tf Vocab version, both schemes performed comparably.When using Tf Lsh with a slightly restricted amount of buckets ( b = 2 22 ), reasonable performance can be achie v ed while maintaining a relati v ely low memory footprint.The investigation of character-and sub-wor d-le v el tokenization methods showed that they both performed significantl y worse w hen compared to k -mer le v el encoding.Byte-pair-encoding suffers from undetectable single nucleotide changes since subw ord tok ens are generated in a consecuti v e and not in an overlapping manner.
To demonstrate the applicability of our approach, we analyzed a real-world metagenomic gut dataset with focus on IBD ( 39 ).We discovered potential candidate species for characterizing uc, cd and healthy specimen in the uncultured species, validating the importance of such datasets.Our results and the current research show a species specific interaction between the microbiome and IBD ( 39 , 53 , 66 ), though a detailed view is still missing.Ther efor e, mor e r esearch must be conducted in the use of uncultured species and effect on the overall gut health possibly decoding the missing links in the IBD emergence.
With the rapid growth of computational power and advances in deep learning, new approaches might become a viable strategy for taxonomic read classification alongside established alignment-free methods, especially for highdimensional data and multiple analysis steps on big data problems.Hence, further r esear ch in the field is needed.

SUPPLEMENT ARY DA T A
Supplementary Data are available at NARGAB Online.

4 3 Figure 1 .
Figure 1.Composition of the HGR UMGS ( 4 ) microbial dataset.Count at genus-le v el for the HGR UMGS dataset based on 2,505 species.Only the Top 10 genera are displayed, the remaining taxa are summarized as 'Rest'.The 'Unclassified' genera have a higher order classification.

Figure 2 .Figure 3 .
Figure 2.Transformer encoder block design.The encoder block consists of a multi-head self-attention module, followed by layer normalization, a pointwise feed-forward layer and another layer normalization.Before each layer normalization, a residual connection re-adds the input of the previous module.

Figure 4 .
Figure 4. Multi-le v el interconnected classifica tion heads.Each classifica tion head uses the r epr esenta tion genera ted by the model backbone to perform prediction.Additionally, higher-le v el predictions inform lower le v els about their prediction by linear ly tr ansforming their output to match the output shape of the lower le v el.

Figure 5 .
Figure 5. Performance of Tf Vocab for increasing k -mer size.Precision, recall and loss for increasing values of k for the Tf Vocab architecture are displayed.Values were measured during training on the validation data based on the HGR UMGS dataset.

Figure 6 .
Figure 6.Average L2 Norm of the genus abundance prediction error for different embeddings on the Benc hmar kMoc k dataset.DeepMicrobes is used as a r efer ence.

Figure 7 .
Figure 7. Precision of the genus abundance prediction for different classification tools on the Benc hmar kMoc k dataset.Median is colored in black and mean in blue.

NAR 11 Figure 8 .
Figure 8. Recall of the genus abundance prediction for different classification tools on the Benc hmar kMoc k dataset.Median is colored in black and mean in blue.

Figure 9 .
Figure 9. Precision of the species abundance prediction for different classification tools on the Benc hmar kGut dataset.Median is colored in black and mean in blue.

Figure 10 .
Figure 10.Recall of the species abundance prediction for different classification tools on the Benc hmar kGut dataset.Median is in black and mean in blue.

Figure 11 .
Figure 11.Total number of misclassified reads for different classification tools on the AbsentSpecies dataset.

Figure 12 .
Figure12.LDA score of significant dif ferentia ted species.LDA score of e v ery species with significant (Kruskal-Wallis P < 0.05 and LDA score > 2) difference between two or more groups.cd is r epr esentati v e for Crohn's disease, uc for ulcerati v e colitis and control for healthy subjects.In brackets is the closest taxonomic classification for each species.
Emergent AI Center, funded by the Carl-Zeiss-Stiftung; MetaDL project funded by the German Federal Ministry of Education and Research (BMBF) [01IS19071C].Conflict of interest statement.None declared.

3 3 Table 1 .
List of all datasets

Table 2 .
Hyperpar ameters for our tr ansformer encoder models a trainable parameter to reduce the conca tena tion into a single vector of size d model again.Ther efor e, O red is of size d model with e v ery method of reduction.The actual classification is then performed by a linear classifier

Table 3 .
Genus-le v el perf ormance f or different embedding schemesPerformance of the embedding schemes in combination with the transformer model on the genus-le v el.We only included the best performing variants for each scheme.Pr ecision, r ecall and loss of the model on the training and validation set as well as the training time of each model are r eported.brepr esents the number of buckets.All models, except for LSH ( k = 15, b = 223) with 420,000, the Vocab ML with 1,000,000 and the Vocab ( k = 12) with 500,000, are learned for 300,000 epochs.

Table 4 .
Vocab species-le v el performance Normalized abundance density of the different groups for species UMGS129, GCF 001078425, UMGS368 and UMGS42.Crohn's disease is r epr esented by cd, ulcerati v e colitis by uc and healthy subjects by control.The abbreviation after the genus is the significant group determined by the LEfSe.Inf erence .Due to space limitations for the larger datasetsBenc hmar kMoc k and BenchGenus , both were split into multiple smaller files and processed successi v ely.For the smaller species sets multiple files were processed sim ultaneousl y.On the Benc hmar kMoc k dataset MetaT k13 outperforms Deep- than MetaT k12 , while the structure is equal.Therefore, calculation is faster which is also seen in the faster training times.Howe v er, with the smaller size MetaT k12 can process more species files sim ultaneousl y.Furthermore, model loading times are longer for the MetaT k13 model.The speedup and times for both benchmark datasets are sum-marized in Table

Table 5 .
Species and genus-le v el inference speed for Benc hmar kMoc kInference speed for genus and species-le v el of MetaT k12 , MetaT k13 and DeepMicrobes on the Benc hmar kMoc k dataset.The r esults ar e averaged over all 10 samples.